The purpose of this notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].
The following steps are taken:
Ingest data
Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].
Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.
From the help of tolower:
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
if( !exists("dfNYDataStates") ) {
dfNYDataStates <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv", stringsAsFactors = FALSE )
colnames(dfNYDataStates) <- capwords(colnames(dfNYDataStates))
}
head(dfNYDataStates)
dfNYDataStates$DateObject <- as.POSIXct(dfNYDataStates$Date)
summary(as.data.frame(unclass(dfNYDataStates)))
Date State Fips Cases Deaths DateObject
2020-04-09: 56 Washington : 83 Min. : 1.00 Min. : 0 Min. : 0.00 Min. :2020-01-21 00:00:00
2020-04-10: 56 Illinois : 80 1st Qu.:17.00 1st Qu.: 11 1st Qu.: 0.00 1st Qu.:2020-03-12 00:00:00
2020-04-11: 56 California : 79 Median :31.00 Median : 144 Median : 2.00 Median :2020-03-23 00:00:00
2020-04-12: 56 Arizona : 78 Mean :31.11 Mean : 2515 Mean : 75.63 Mean :2020-03-20 12:22:07
2020-03-28: 55 Massachusetts: 72 3rd Qu.:46.00 3rd Qu.: 1066 3rd Qu.: 21.00 3rd Qu.:2020-04-02 00:00:00
2020-03-29: 55 Wisconsin : 68 Max. :78.00 Max. :188694 Max. :9385.00 Max. :2020-04-12 00:00:00
(Other) :1939 (Other) :1813
Summary by state:
by( data = as.data.frame(unclass(dfNYDataStates)), INDICES = dfNYDataStates$State, FUN = summary )
Alternative summary:
Hmisc::describe(dfNYDataStates)
if(!exists("dfNYDataCounties") ) {
dfNYDataCounties <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv", stringsAsFactors = FALSE )
colnames(dfNYDataCounties) <- capwords(colnames(dfNYDataCounties))
}
head(dfNYDataCounties)
dfNYDataCounties$DateObject <- as.POSIXct(dfNYDataCounties$Date)
summary(as.data.frame(unclass(dfNYDataCounties)))
Date County State Fips Cases Deaths DateObject
2020-04-12: 2679 Washington: 682 Georgia : 3182 Min. : 1001 Min. : 0.0 Min. : 0.000 Min. :2020-01-21 00:00:00
2020-04-11: 2660 Unknown : 649 Texas : 3134 1st Qu.:17161 1st Qu.: 2.0 1st Qu.: 0.000 1st Qu.:2020-03-26 00:00:00
2020-04-10: 2629 Jefferson : 519 Virginia : 2205 Median :28135 Median : 5.0 Median : 0.000 Median :2020-04-02 00:00:00
2020-04-09: 2595 Franklin : 478 Indiana : 1868 Mean :29529 Mean : 106.2 Mean : 3.179 Mean :2020-03-31 07:50:28
2020-04-08: 2565 Jackson : 423 North Carolina: 1855 3rd Qu.:42125 3rd Qu.: 22.0 3rd Qu.: 1.000 3rd Qu.:2020-04-07 00:00:00
2020-04-07: 2539 Montgomery: 398 Tennessee : 1813 Max. :56043 Max. :103208.0 Max. :6717.000 Max. :2020-04-12 00:00:00
(Other) :38181 (Other) :50699 (Other) :39791 NA's :716
if(!exists("dfUSACountyData")){
dfUSACountyData <- read.csv( "https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv", stringsAsFactors = FALSE )
}
head(dfUSACountyData)
summary(as.data.frame(unclass(dfUSACountyData)))
Country State County FIPS Population Lat Lon
UnitedStates:3143 Texas : 254 WashingtonCounty: 30 Min. : 1001 Min. : 89 Min. :19.60 Min. :-166.90
Georgia : 159 JeffersonCounty : 25 1st Qu.:18178 1st Qu.: 10980 1st Qu.:34.70 1st Qu.: -98.23
Virginia: 134 FranklinCounty : 24 Median :29177 Median : 25690 Median :38.37 Median : -90.40
Kentucky: 120 JacksonCounty : 23 Mean :30390 Mean : 102248 Mean :38.46 Mean : -92.28
Missouri: 115 LincolnCounty : 23 3rd Qu.:45082 3rd Qu.: 67507 3rd Qu.:41.81 3rd Qu.: -83.43
Kansas : 105 MadisonCounty : 19 Max. :56045 Max. :10170292 Max. :69.30 Max. : -67.63
(Other) :2256 (Other) :2999
dsNYDataCountiesExtended <-
dfNYDataCounties %>%
dplyr::inner_join( dfUSACountyData %>% dplyr::select_at( .vars = c("FIPS", "Lat", "Lon", "Population") ), by = c( "Fips" = "FIPS" ) )
dsNYDataCountiesExtended
ParetoPlotForColumns( dsNYDataCountiesExtended, c("Cases", "Deaths"), scales = "free" )
cf <- colorBin( palette = "Reds", domain = log10(dsNYDataCountiesExtended$Cases), bins = 10 )
m <-
leaflet( dsNYDataCountiesExtended[, c("Lat", "Lon", "Cases")] ) %>%
addTiles() %>%
addCircleMarkers( ~Lon, ~Lat, radius = ~ log10(Cases), fillColor = ~ cf(log10(Cases)), color = ~ cf(log10(Cases)), fillOpacity = 0.8, stroke = FALSE, popup = ~Cases )
m
An alternative of the geo-visualization is to use a heat-map plot.
Make a heat-map plot by sorting the rows of the cross-tabulation matrix (that correspond to states):
matSDC <- xtabs( Cases ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDC+1), cellnote = as.matrix(matSDC), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Cross-tabulate states with dates over deaths and plot:
matSDD <- xtabs( Deaths ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDD+1), cellnote = as.matrix(matSDD), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
In this section we do simple “forecasting” (not a serious attempt).
Make time series data frame in long form:
dfQuery <-
dfNYDataStates %>%
dplyr::group_by( Date, DateObject ) %>%
dplyr::summarise_at( .vars = c("Cases", "Deaths"), .funs = sum )
dfQueryLongForm <- tidyr::pivot_longer( dfQuery, cols = c("Cases", "Deaths"), names_to = "Variable", values_to = "Value")
head(dfQueryLongForm)
Plot the time series:
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = log10(Value) ) ) +
facet_wrap( ~Variable, ncol = 1 )
Fit using ARIMA:
fit <- forecast::auto.arima( dfQuery$Cases )
fit
Series: dfQuery$Cases
ARIMA(0,2,0)
sigma^2 estimated as 3119674: log likelihood=-720.54
AIC=1443.08 AICc=1443.13 BIC=1445.47
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
Fit with ARIMA:
fit <- forecast::auto.arima( dfQuery$Deaths )
fit
Series: dfQuery$Deaths
ARIMA(1,2,0)
Coefficients:
ar1
-0.3135
s.e. 0.1100
sigma^2 estimated as 18846: log likelihood=-513.17
AIC=1030.33 AICc=1030.49 BIC=1035.12
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")